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Abstract 

We derive an improved upper bound for the VC-dimension of neural networks with 
polynomial activation functions. This improved bound is based on a result of Rojas 
|Roj00|l on the number of connected components of a semi-algebraic set. 
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1 Introduction 


We examine neural networks with polynomial activation functions. The specific architecture 
of the neural networks is described in detail in the next section. Such neural networks have 
been the subject of active investigation for several years, since powerful tools from algebraic 
geometry can be brought into play in analyzing the VC-dimension of such networks. Perhaps 


GJ93|| was the first paper to connect these two subjects. For several years (see for example 


KM97|| ) it has been known that every bound on the number of connected components 


of a semi-algebraic set can be readily translated into a corresponding bound on the VC- 
dimension of a neural network architecture. Practically all of the known bounds on the 
VC-dimension of neural networks with polynomial activation bounds make use of a classical 
result discovered by Milnor, Oleinik, Petrovsky, and Thom | pl'49| . [MiI64| , |Tho65|| fj This 
bound, while easy to use, is usually much larger than necessary, since it only uses coarse 
information about the underlying set such as the number of variables and the maximum 
degree of the input polynomials. More recently, sharper bounds using more refined data 
from the input polynomials have been discovered. In the present note, we use a result due to 
Rojas |]RojOC(| that is particularly well-suited to neural networks with polynomial activation 
functions. The present bound is, in all cases, sharper than the earlier bound of Goldberg and 
Jerrum [PJ93| . Moreover, it is intuitively appealing, as the improvement can be quantified 
as the relative entropy of two probability vectors, whose dimension equals the number of 
layers in the neural network. This shows that the problem of bounding the VC-dimension 
of a neural network architecture continues to be interesting, and that we should strive to 
derive even tighter upper bounds. 

Our main result is stated in theorem [| of section ffj. The recent semi-algebraic bound it is 
based on is stated as theorem of section |3|. However, let us first review a bit of background 
and some of the earlier bounds. 


2 Known Results 


The following definition of the VC-dimension is standard; see for example the books by 


Vapnik [|Vap95|| or Vidyasagar [|MV97 


Definition 1 Suppose X is a set and T is a collection of {0,1 }-valued functions on X. A 
set S = {aq,... , x n } C X is said to be shattered by T if each of the 2 n functions mapping 

Actually, this result bounds the sum of the Betti numbers of a semi-algebraic set, and this quantity is 
always at least as large as the number of connected components. In practice, one usually only needs an 
upper bound on the number of connected components. 
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S into {0, 1} is the restriction to S of some function in T. The Vapnik-Chervonenkis 
(VC)-dimension of T is the largest integer n such that there exists a set of cardinality n 
that is shattered by T. o 

By identifying a {0, l}-valued function with its support set, it is also possible to speak of the 
VC-dimension of a collection of sets. In the sequel, we shall use both notions interchangeably. 

Following by now familiar approaches, we view a neural network as a verifier of formulas. 
Specifically, let w G R fe denote the “weight vector” or the vector adjustable parameters in a 
neural networks. A neural network with input space A" C R w and weight vector w evaluates 
a logical proposition 0(x, w) which is a Boolean combination of s atomic expressions of the 
form Tj(x, w) = 0 or Tj(x, w) >0. Letting 1 (resp. 0) denote “true” (resp. “false”), we can 
thus think of (j) as a function from R fe+iV to {0, 1}. So for each weight vector w, define 

A w := {x G R W : 0(w,x) = 1}. 

The objective is to obtain an upper bound on the VC-dimension of the collection of sets 
A := {A w | w G R fc } or, equivalently, the VC-dimension of the collection of {0, l}-valued 
functions <f> := {0(w, j | w G R fc }. 

To state the result, we need one final bit of notation: Let xi,... ,x v G U N , and suppose 
sv > k. From the sv polynomials Tj(-, Xj) determined by all (z, j) G {1,... , v} x {1,... , s}, 
choose r < k polynomials, and label them 6 1 (-),... , 0 r (-) : R fc —> R. Define 

f(w) := [6 ) 1 (w) ... # r (w)] G R r . 


Finally, let B denote the maximum number of connected components of any pre-image 
/ _1 (z/) with y G R r , for any choice of r and ... ,9 r as above. With the above set-up, 
the following result is proved in | lvM97]| . For further background on our setting below see 
KWH\ or 1MV371 , p. 329. 


Theorem 1 Following the notation above, assume further that restrict to those z/G R r that 
are regular values of f. Then 


VCDIM(<f>) < 21gR + 2k lg(2es). 


That B is in fact finite and admits an explicit upper bound is obtained by appealing to 
the aforementioned classical result of Milnor, Oleinik, Petrovsky, and Thom | OP49 , Mil64, 
Tho65||, which we now state as follows: 


Lemma 1 Suppose 0 1; ... , 0 r are polynomials in k variables, with degree no larger than d. 
Then whenever y is a regular value of f as defined above, the preimage f -1 (y) contains no 
more than d(2d) fc_1 connected components. 
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Note that Milnor actually proves the theorem in the case where y = 0, but we can clearly 
perturb the constant terms of the 6 t to enforce this assumption. If we replace the quantity 
d(2d) k ~ 1 by the larger number (2 d) k and substitute B = (2 d) k into the upper bound (HD , we 
get the following result. 


Theorem 2 Suppose </> is a Boolean formula involving a total of s polynomial equalities and 
inequalities, where each polynomial has degree no larger than d with respect to w. Then 
VCDIM(<f>) < 2k lg(4eds). 


The above result is the same as that derived in [ GJ93 . It should be noted, however, that 
Goldberg and Jerrurn actually consider neural networks with piecewise polynomial activation 
functions. With more elaborate notation, their results can be derived as special cases of 
Theorem [l]. 

Theorem [T] shows the importance of deriving tight upper bounds on the number of con¬ 
nected components of a semi-algebraic set. This is a long-standing problem in real algebraic 
geometry that has received considerable attention from the research community. It is obvi¬ 
ous from the bound (]Ij) that any improvement over Milnor’s upper bound translates directly 
into a corresponding improvement in the estimate of the VC-dimension of a neural network 
architecture with polynomial activation functions. This leads us to the next topic. 


3 Improved Upper Bound on the Number of Connected 
Components 


In [|RojOO|| , an improvement is provided over Milnor’s bound. To state this improved result, 
a bit of notation is introduced. 

Let A n denote the standard n-simplex in R n , with vertices the standard basis vectors 
and the origin. Note that 


j U 

dA n < (xi ,... , x n ) gR" | Xi > 0 for all i and E Xi<d 

{ i=\ 

Let Vol n (-) denote the renormalization of the usual volume in R n satisfying Vol n (A n ) = 1. 
(Since the usual n-dirriensional volume is multiplicative for orthogonal subspaces, it is easy 
to prove by induction that Vol n is just n\ times the usual n-dirriensional volume.) 



Theorem 3 Suppose Ti,... ,r r are polynomials in (wi ,... , Wk), and let e 1; ... , e k and O 
denote the standard basis vectors and the origin ofR k . Also, let V denote the convex hull of 
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the union of{ O, ei,... , efc} with the set of all (q,... , ik ) suc/i that wf ... w]f is a monomial 
of some 9j{-). Then 

B < 2 k \o\ k {V). 

In the special case where every fc-tuple with h — d occurs in V, we recover the 

(adjusted) Milnor bound (2 d) k . However, the whole point of the preceding refined bound is 
that there are many instances where the input polynomial are far more sparse, and this can 
be exploited. 

4 Improved Upper Bound on the VC-Dimension 

In this section, we derive an improved upper bound on the VC-dimension of neural net¬ 
works with polynomial activation functions. The improved bound is a direct consequence of 
coupling Theorems |1] and |3]. 

Let us begin by describing the class of neural networks under study. It is assumed that 
the network has N real inputs denoted by aq,... , xjv- There are l levels in the network, 
and at level i there are gy output neurons; however, at the output layer (level l) there is 
only a single neuron (see below). Let ki denote the number of adjustable parameters, or 
“weights,” at level i, and let k = Ui=i denote the total number of adjustable parameters. 
Let w i := ... ,w^ ki ) denote the weight vector at level i, and w = (wy ... w i) denote 

the total weight vector. The input-output relationship of each neuron at level i is of the 
form 


Vi,j B,j (Wj, Vi-l, 1, • • • ■ : yi—l,q i _f)i j 1, . . . , Qi- 

where is the output of neuron j at level i, and r M - is a polynomial of degree no larger 
than ay in the components of the weight vector Wj, and no larger ff in the components of the 
vectors Vi-ij- At the final layer, there is a simple perceptron device following the polynomial 
activation function. 

With this class of neural networks, it is clear that the output will equal one if and only 
if a polynomial inequality of the form 


2/z( w > x ) > 0, 

is satisfied, where w is the weight vector and x = (aq ... xn ) is the input vector. Thus we 
can apply Theorem [I] with s — 1. The issue now is to determine the number of connected 
components B of the semi-algebraic set defined by yf w,x) = y. 
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Now we are in a position to state the main result. To facilitate the statement, we 
introduce a bit of notation. Define 



j=i +1 


Recall that ki denotes the number of adjustable parameters at level i, and that k denotes 
the total number of adjustable parameters. Define the probability vectors 


v := (ki/k ... ki/k), u \= (di/d... ,di/d ), 


and define the “binary” relative entropy //(v|u) as 


H(v\u) : = X Vilg 



dki 

kdi 


Note that the above is the same as the conventional relative entropy of two probability 
vectors, except that we use base-2 logarithms instead of natural logarithms. Following 
standard convention, we take 01g(0/0) = 0. 

Theorem 4 With the above notation, we have 



(4.1) 


(4.2) 


where d := X=i ^ ani ^ we assume ki,... ,ki> 2 in the last inequality. Consequently, when 


ki,... ,k n > 2, the VC-dimension of the neural network architecture is bounded above by 


2/c(lg(4ed) — i4(v|u)). 


Remark The above theorem shows that the reduction in the VC-dimension estimate over 


that of Theorem [2| is precisely 2k times the (binary) relative entropy of the two probability 


vectors v and u defined above. Thus if ki/k = di/d for all i, there will not be any reduction 
at all. In general, the fraction by which the older VC-dimension estimate is reduced is 
precisely the ratio i/(v|u)/(lg(4ed)). Note also that the assumption that there are at least 
2 adjustable parameters at each levels is a reasonably mild assumption, o 
Proof of Theorem f|: The proof depends on a careful book-keeping of the degree of 
yf w,x) with respect to the various components of w. From the architecture of the neural 
network, it is clear that at the first level, each of the yij is a polynomial in the components 


6 



of wi of degree no larger than ol\. At the second level, each of the y 2j is a polynomial, 
whose monomials are of (combined) degree no larger than a 2 in the components of w 2 , and 
of (combined) degree no larger than /3 2 aq in the components of wj. Thus, while each y 2j - 
could have a total degree of a 2 + /? 2 «i in the components of wi and w 2 , the total degree of 
the monomial terms involving the components of Wi does not exceed /? 2 ai, while the total 
degree of the monomial terms involving the components of w 2 does not exceed ct 2 . A simple 
argument by induction then tells us that at the output layer (level /), the single output yi is 
a polynomial whose monomials have total degree no larger than di = ai in the components of 
w i, no larger than d/_i = /3/cq_i in the components of w;_i, and so on. With the di s defined 
as above, the components of each w, appear with total degree no larger than di. Thus the 
total degree of y could be as large as d, but the monomial terms involving the components 
of w i have total degree no larger than di. So the set V defined in Theorem satisfies the 
following containment: 

relW 

2—1 

Because of this containment, it follows that 


Vol n (V) < k\ J j 


2 — 1 


c 

ki\ 


Combining this with the bound ([?!) establishes the first estimate O- 

To prove the second estimate, we use Stirling’s approximation. In particular, |[Rud76| , ex. 
20, pg. 200] tells us that for all t E {2, 3,4,... }, we have 


e 7 ^{t/e) t Vi <t\ < e(t/eYVt. 


Consequently, we easily obtain 

k\ t _7 fc k k 

h !---fcd <e 8 k k Y ■■■tfWki---k l ' 

Dropping the square root term on the bottom can of course be done, and then an elementary 
calculation yields 2 k k\ n!=i yfr — (Jts/' 2~ kH{v ^ u \ provided ki,... ,k n > 2. 

The VC-dimension estimate ((|) now follows readily from Theorem 0. □ 


5 Numerical Example 

Consider a network with four inputs, five hidden-layer neurons at the first level and an 
output-layer neuron. As is common, let us suppose that cq = 1 for all i. This means 
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that all the adjustable parameters enter linearly into the corresponding activation function. 
Suppose Pi — 2, p 2 — 3. This means that the hidden-layer neurons have quadratic activation 
functions, whereas the output-layer neuron has a cubic activation function. It remains to 
specify the integers k\ and k 2 , representing the number of adjustable parameters. Let us 
assume that practically all of the monomial terms are present in each neural characteristic. 
Thus it is reasonable to assume k\ = 50, k 2 = 20. Finally, d\ — 3, d + 2 = 1. With these 
figures, one has 

v = (5/7, 2/7) , u = (0.25,0.75), 


H{y |u) « 0.684033, lg(4ed) ~ 5.4427, 


H(y |u) 
lg(4ed) 


« 0.12567. 


Thus, in this case, the improved bound is roughly 12.5% sharper. 


6 Conclusions 
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